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Abstract 

Recognizing  textual  entailment  is  a  chal¬ 
lenging  problem  and  a  fundamental  com¬ 
ponent  of  many  applications  in  natural 
language  processing.  We  present  a  novel 
framework  for  recognizing  textual  entail¬ 
ment  that  focuses  on  the  use  of  syntactic 
heuristics  to  recognize  false  entailment. 

We  give  a  thorough  analysis  of  our  sys¬ 
tem,  which  demonstrates  state-of-the-art 
performance  on  a  widely-used  test  set. 

1  Introduction 

Recognizing  the  semantic  equivalence  of  two  frag¬ 
ments  of  text  is  a  fundamental  component  of  many 
applications  in  natural  language  processing.  Recog¬ 
nizing  textual  entailment,  as  formulated  in  the  recent 
PASCAL  Challenge  is  the  problem  of  determining 
whether  some  text  sentence  T  entails  some  hypothe¬ 
sis  sentence  H. 

The  motivation  for  this  formulation  was  to  iso¬ 
late  and  evaluate  the  application-independent  com¬ 
ponent  of  semantic  inference  shared  across  many  ap¬ 
plication  areas,  reflected  in  the  division  of  the  PAS¬ 
CAL  RTE  dataset  into  seven  distinct  tasks:  Informa¬ 
tion  Extraction  (IE),  Comparable  Documents  (CD), 
Reading  Comprehension  (RC),  Machine  Translation 
(MT),  Information  Retrieval  (IR),  Question  Answer¬ 
ing  (QA),  and  Paraphrase  Acquisition  (PP). 

1  http://www.pascal-network.org/Challenges/RTE.  The  ex¬ 
amples  given  throughout  this  paper  are  from  the  first  PASCAL 
RTE  dataset,  described  in  Section  6. 
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The  RTE  problem  as  presented  in  the  PASCAL 
RTE  dataset  is  particularly  attractive  in  that  it  is  a 
reasonably  simple  task  for  human  annotators  with 
high  inter-annotator  agreement  (95.1%  in  one  inde¬ 
pendent  labeling  (Bos  and  Markert,  2005)),  but  an 
extremely  challenging  task  for  automated  systems. 
The  highest  accuracy  systems  on  the  RTE  test  set 
are  still  much  closer  in  performance  to  a  random 
baseline  accuracy  of  50%  than  to  the  inter-annotator 
agreement.  For  example,  two  high-accuracy  systems 
are  those  described  in  (Tatu  and  Moldovan,  2005), 
achieving  60.4%  accuracy  with  no  task-specific  in¬ 
formation,  and  (Bos  and  Markert,  2005),  which 
achieves  61.2%  task-dependent  accuracy,  i.e.  when 
able  to  use  the  specific  task  labels  as  input. 

Previous  systems  for  RTE  have  attempted  a  wide 
variety  of  strategies.  Many  previous  approaches 
have  used  a  logical  form  representation  of  the  text 
and  hypothesis  sentences,  focusing  on  deriving  a 
proof  by  which  one  can  infer  the  hypothesis  logical 
form  from  the  text  logical  form  (Bayer  et  al.,  2005; 
Bos  and  Markert,  2005;  Raina  et  al.,  2005;  Tatu  and 
Moldovan,  2005).  These  papers  often  cite  that  a  ma¬ 
jor  obstacle  to  accurate  theorem  proving  for  the  task 
of  textual  entailment  is  the  lack  of  world  knowledge, 
which  is  frequently  difficult  and  costly  to  obtain  and 
encode.  Attempts  have  been  made  to  remedy  this 
deficit  through  various  techniques,  including  model¬ 
building  (Bos  and  Markert,  2005)  and  the  addition 
of  semantic  axioms  (Tatu  and  Moldovan,  2005). 

Our  system  diverges  from  previous  approaches 
most  strongly  by  focusing  upon  false  entailments; 
rather  than  assuming  that  a  given  entailment  is  false 
until  proven  true,  we  make  the  opposite  assump- 
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tion,  and  instead  focus  on  applying  knowledge-free 
heuristics  that  can  act  locally  on  a  subgraph  of  syn¬ 
tactic  dependencies  to  determine  with  high  confi¬ 
dence  that  the  entailment  is  false.  Our  approach  is 
inspired  by  an  analysis  of  the  RTE  dataset  that  sug¬ 
gested  a  syntax-based  approach  should  be  approxi¬ 
mately  twice  as  effective  at  predicting  false  entail¬ 
ment  as  true  entailment  (Vanderwende  and  Dolan, 
2006).  The  analysis  implied  that  a  great  deal  of  syn¬ 
tactic  information  remained  unexploited  by  existing 
systems,  but  gave  few  explicit  suggestions  on  how 
syntactic  information  should  be  applied;  this  paper 
provides  a  starting  point  for  creating  the  heuristics 
capable  of  obtaining  the  bound  they  suggest2. 

2  System  Description 


lemma:  free 
pos:  Verb 


“A 

lemma:  X 

^  pos:  Pron 

^Tsub _ _ 

Figure  1:  Logical  form  produced  by  NLPWin  for 
the  sentence  “Six  hostages  in  Iraq  were  freed.” 

4.  If  no  syntactic  heuristic  matches,  back  off  to 
a  lexical  similarity  model  (described  in  section 
5.1),  with  an  attempt  to  align  detected  para¬ 
phrases  (described  in  section  5.2). 


Similar  to  most  other  syntax-based  approaches  to 
recognizing  textual  entailment,  we  begin  by  rep¬ 
resenting  each  text  and  hypothesis  sentence  pair 
in  logical  forms.  These  logical  forms  arc  gener¬ 
ated  using  NLPWin3,  a  robust  system  for  natural 
language  parsing  and  generation  (Heidorn,  2000). 
Our  logical  form  representation  may  be  consid¬ 
ered  equivalently  as  a  set  of  triples  of  the  form 
REL ATlON(nodei,  node j),  or  as  a  graph  of  syntac¬ 
tic  dependencies;  we  use  both  terminologies  inter¬ 
changeably.  Our  algorithm  proceeds  as  follows: 

1.  Parse  each  sentence  with  the  NLPWin  parser, 
resulting  in  syntactic  dependency  graphs  for  the 
text  and  hypothesis  sentences. 

2.  Attempt  an  alignment  of  each  content  node  in 
the  dependency  graph  of  the  hypothesis  sen¬ 
tence  to  some  node  in  the  graph  of  the  text  sen¬ 
tence,  using  a  set  of  heuristics  for  alignment 
(described  in  Section  3). 

3.  Using  the  alignment,  apply  a  set  of  syntactic 
heuristics  for  recognizing  false  entailment  (de¬ 
scribed  in  Section  4);  if  any  match,  predict  that 
the  entailment  is  false. 

2  (Vanderwende  and  Dolan.  2006)  suggest  that  the  truth  or 
falsehood  of  48%  of  the  entailment  examples  in  the  RTE  test  set 
could  be  correctly  identified  via  syntax  and  a  thesaurus  alone; 
thus  by  random  guessing  on  the  rest  of  the  examples  one  might 
hope  for  an  accuracy  level  of  0.48  +  =  74%. 

3To  aid  in  the  replicability  of  our  experiments,  we  have 
published  the  NLPWin  logical  forms  for  all  sentences  from 
the  development  and  test  sets  in  the  PASCAL  RTE  dataset  at 
http://research.microsoft.com/nlp/Projects/RTE.aspx. 


In  addition  to  the  typical  syntactic  information  pro¬ 
vided  by  a  dependency  parser,  the  NLPWin  parser 
provides  an  extensive  number  of  semantic  features 
obtained  from  various  linguistic  resources,  creating 
a  rich  environment  for  feature  engineering.  For  ex¬ 
ample,  Figure  1  (from  Dev  Ex.  #616)  illustrates  the 
dependency  graph  representation  we  use,  demon¬ 
strating  the  stemming,  part-of-speech  tagging,  syn¬ 
tactic  relationship  identification,  and  semantic  fea¬ 
ture  tagging  capabilities  of  NLPWin. 

We  define  a  content  node  to  be  any  node  whose 
lemma  is  not  on  a  small  stoplist  of  common  stop 
words.  In  addition  to  content  vs.  non-content  nodes, 
among  content  nodes  we  distinguish  between  en¬ 
tities  and  nonentities:  an  entity  node  is  any  node 
classified  by  the  NLPWin  parser  as  being  a  proper 
noun,  quantity,  or  time. 

Each  of  the  features  of  our  system  were  developed 
from  inspection  of  sentence  pairs  from  the  RTE  de¬ 
velopment  data  set,  and  used  in  the  final  system  only 
if  they  improved  the  system's  accuracy  on  the  de¬ 
velopment  set  (or  improved  F-score  if  accuracy  was 
unchanged);  sentence  pairs  in  the  RTE  test  set  were 
left  uninspected  and  used  for  testing  purposes  only. 

3  Linguistic  cues  for  node  alignment 

Our  syntactic  heuristics  for  recognizing  false  entail¬ 
ment  rely  heavily  on  the  correct  alignment  of  words 
and  multiword  units  between  the  text  and  hypothesis 
logical  forms.  In  the  notation  below,  we  will  con¬ 
sider  h  and  t  to  be  nodes  in  the  hypothesis  H  and 
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Figure  2:  Example  of  synonym,  value,  and  deriva¬ 
tional  form  alignment  heuristics,  Dev  Ex.  #767 

text  T  logical  forms,  respectively.  To  accomplish 
the  task  of  node  alignment  we  rely  on  the  following 
heuristics: 

3.1  WordNet  synonym  match 

As  in  (Herrera  et  al.,  2005)  and  others,  we  align 
a  node  h  G  H  to  any  node  t  €  T  that  has  both 
the  same  paid  of  speech  and  belongs  to  the  same 
synset  in  WordNet.  Our  alignment  considers  mul¬ 
tiword  units,  including  compound  nouns  (e.g.,  we 
align  "Oscar”  to  “Academy  Award”  as  in  Figure  2), 
as  well  as  verb-particle  constructions  such  as  “set 
off”  (aligned  to  “trigger”  in  Test  Ex.  #1983). 

3.2  Numeric  value  match 

The  NLP  Win  parser  assigns  a  normalized  numeric 
value  feature  to  each  piece  of  text  inferred  to  cor¬ 
respond  to  a  numeric  value;  this  allows  us  to  align 
“6th”  to  “sixth”  in  Test  Ex.  #1175.  and  to  align  “a 
dozen”  to  “twelve”  in  Test  Ex.  #1231. 

3.3  Acronym  match 

Many  acronyms  arc  recognized  using  the  syn¬ 
onym  match  described  above;  nonetheless,  many 
acronyms  arc  not  yet  in  WordNet.  For  these  cases  we 
have  a  specialized  acronym  match  heuristic  which 
aligns  pairs  of  nodes  with  the  following  properties: 
if  the  lemma  for  some  node  h  consists  only  of  cap¬ 
italized  letters  (with  possible  interceding  periods), 
and  the  letters  correspond  to  the  first  characters  of 
some  multiword  lemma  for  some  t  G  T,  then  we 
consider  h  and  t  to  be  aligned.  This  heuristic  allows 
us  to  align  “UNDP”  to  “United  Nations  Develop¬ 
ment  Programme”  in  Dev  Ex.  #357  and  “ANC”  to 
“African  National  Congress”  in  Test  Ex.  #1300. 


3.4  Derivational  form  match 

We  would  like  to  align  words  which  have  the  same 
root  form  (or  have  a  synonym  with  the  same  root 
form)  and  which  possess  similar  semantic  meaning, 
but  which  may  belong  to  different  syntactic  cate¬ 
gories.  We  perform  this  by  using  a  combination  of 
the  synonym  and  derivationally-related  form  infor¬ 
mation  contained  within  WordNet.  Explicitly  our 
procedure  for  constructing  the  set  of  derivationally- 
related  forms  for  a  node  h  is  to  take  the  union  of  all 
derivationally-related  forms  of  all  the  synonyms  of 
h  (including  h  itself),  i.e.: 

DERI  V(/l)  =  UsgWN_SYN(/i)  WN-DERIV(s) 

In  addition  to  the  noun/verb  derivationally-related 
forms,  we  detect  adjective/adverb  derivationally- 
related  forms  that  differ  only  by  the  suffix  ‘ly’. 

Unlike  the  previous  alignment  heuristics,  we  do 
not  expect  that  two  nodes  aligned  via  derivationally- 
related  forms  will  play  the  same  syntactic  role  in 
their  respective  sentences.  Thus  we  consider  two 
nodes  aligned  in  this  way  to  be  soft-aligned,  and  we 
do  not  attempt  to  apply  our  false  entailment  recog¬ 
nition  heuristics  to  nodes  aligned  in  this  way. 

3.5  Country  adjectival  form  /  demonym  match 

As  a  special  case  of  derivational  form  match,  we 
soft-align  matches  from  an  explicit  list  of  place 
names,  adjectival  forms,  and  demonyms4;  e.g., 
“Sweden”  and  “Swedish”  in  Test  Ex.  #1576. 

3.6  Other  heuristics  for  alignment 

In  addition  to  these  heuristics,  we  implemented  a  hy- 
ponym  match  heuristic  similar  to  that  discussed  in 
(Herrera  et  ah,  2005),  and  a  heuristic  based  on  the 
string-edit  distance  of  two  lemmas;  however,  these 
heuristics  yielded  a  decrease  in  our  system's  accu¬ 
racy  on  the  development  set  and  were  thus  left  out 
of  our  final  system. 

4  Recognizing  false  entailment 

The  bulk  of  our  system  focuses  on  heuristics  for 
recognizing  false  entailment.  For  purposes  of  no¬ 
tation,  we  define  binary  functions  for  the  existence 

4List  of  adjectival  forms  and  demonyms  based  on  the  list  at: 
http://en.wikipedia.org/wiki/List_of_demonyms 
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Unaligned  Entity: 

ENTITY  (h)  A  Vt.~ >ALIGN(/i,  t)  — ►  False. 

Negation  Mismatch: 

align (TjTj  A  neg(I)  ^  neg(/i)  — ►  False. 

Modal  Mismatch: 

align (h,t)  A  MOD(t)  A  ->mod(/i)  — ►  False. 

Antonym  Match: 

ALIGN(/li,  ti)  A  REL(/l0,  #l)  A  REL (l0,  A)  A  LEMMA(t0)  £  ANTONYMS(/l0)  — >  False 

Argument  Movement: 

ALIGN(/li,  ti)  A  ALIGN(/l2,  t2)  A  REL(/li,  h2)  A  -iREL(A,  fa)  A  REL  G  {SUBJ,  OBI,  IND}  — ►  False 

Superlative  Mismatch: 

->(SUPR(/li)  — ►  (ALIGN(fti,  ti)  A  ALIGN(/l2, 62)  A  RELi(/l2,  hi)  A  RELi(t2,tl) 

AV13.(rel2(12, l3)  A  rel2  e  {mod,possr,locn}  — >  rel 2(/i2,ft3)  A  align (h3,t3)))  — >  False 

Conditional  Mismatch: 

ALIGN(/li,  ti)  A  ALIGN(/l2,  t2)  A  COND  £  PATH(ti,  t2)  A  COND  ^  PATH(/li ,  h2)  — »  False 

Table  1:  Summary  of  heuristics  for  recognizing  false  entailment 


of  each  semantic  node  feature  recognized  by  NLP- 
WlN;  e.g.,  if  h  is  negated,  we  state  that  Neg (h)  = 
TRUE.  Similarly  we  assign  binary  functions  for 
the  existence  of  each  syntactic  relation  defined  over 
pairs  of  nodes.  Finally,  we  define  the  function 
ALlGN(/i,  t )  to  be  true  if  and  only  if  the  node  h  G  II 
has  been  ‘hard-aligned’  to  the  node  t  G  T  using  one 
of  the  heuristics  in  Section  3.  Other  notation  is  de¬ 
fined  in  the  text  as  it  is  used.  Table  1  summarizes  all 
heuristics  used  in  our  final  system  to  recognize  false 
entailment. 

4.1  Unaligned  entity 

If  some  node  h  has  been  recognized  as  an  entity  (i.e., 
as  a  proper  noun,  quantity,  or  time)  but  has  not  been 
aligned  to  any  node  t,  we  predict  that  the  entailment 
is  false.  For  example,  we  predict  that  Test  Ex.  #1863 
is  false  because  the  entities  “Suwariya”,  “20  miles”, 
and  “35”  in  H  are  unaligned. 

4.2  Negation  mismatch 

If  any  two  nodes  (h,  t)  are  aligned,  and  one  (and 
only  one)  of  them  is  negated,  we  predict  that  the  en¬ 
tailment  is  false.  Negation  is  conveyed  by  the  NEG 
feature  in  NLPWin.  This  heuristic  allows  us  to  pre¬ 
dict  false  entailment  in  the  example  “Pertussis  is  not 
very  contagious”  and  “...pertussis,  is  a  highly  conta¬ 
gious  bacterial  infection”  in  Test  Ex.  #1144. 

4.3  Modal  auxiliary  verb  mismatch 

If  any  two  nodes  (h,  t)  arc  aligned,  and  t  is  modified 
by  a  modal  auxiliary  verb  (e.g,  can ,  might ,  should , 
etc.)  but  h  is  not  similarly  modified,  we  predict  that 
the  entailment  is  false.  Modification  by  a  modal  aux¬ 
iliary  verb  is  conveyed  by  the  MOD  feature  in  NLP¬ 
Win.  This  heuristic  allows  us  to  predict  false  en¬ 
tailment  between  the  text  phrase  “would  constitute 


a  threat  to  democracy”,  and  the  hypothesis  phrase 
“constitutes  a  democratic  threat”  in  Test  Ex.  #1203. 

4.4  Antonym  match 

If  two  aligned  noun  nodes  (hi,ti)  are  both  subjects 
or  both  objects  of  verb  nodes  (ho,  to)  in  their  re¬ 
spective  sentences,  i.e.,  REL(/io,  hi)  A  REL (to,  t\)  A 
REL  G  {sub j,obj},  then  we  check  for  a  verb 
antonym  match  between  (ho,  to).  We  construct 
the  set  of  verb  antonyms  using  WordNet;  we  con¬ 
sider  the  antonyms  of  ho  to  be  the  union  of  the 
antonyms  of  the  first  three  senses  of  lemma(/i0), 
or  of  the  nearest  antonym-possessing  hypernyms  if 
those  senses  do  not  themselves  have  antonyms  in 
WordNet.  Explicitly  our  procedure  for  constructing 
the  antonym  set  of  a  node  ho  is  as  follows: 

1.  ANTON YMS(/l0)  =  {} 

2.  For  each  of  the  first  three  listed  senses  s  of 
lemma(Zio)  in  WordNet: 

(a)  While  |wn-ANTONYMS(s)|  =  0 

i.  s  <—  wn-hypernym(s) 

(b)  ANTONYMS  (ho)  <—  ANTONYMS (/lo)  U 
WN- ANTONYMS  (s) 

3.  return  ANTONYMS (/io) 

In  addition  to  the  verb  antonyms  in  WordNet,  we 
detect  the  prepositional  antonym  pairs  {before/after, 
to/from,  and  over/under).  This  heuristic  allows  us  to 
predict  false  entailment  between  “Black  holes  can 
lose  mass...”  and  “Black  holes  can  regain  some  of 
their  mass...”  in  Test  Ex.  #1445. 

4.5  Argument  movement 

For  any  two  aligned  verb  nodes  (hi,t\),  we  con¬ 
sider  each  noun  child  /12  of  hi  possessing  any  of 
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Figure  3:  Example  of  object  movement  signaling 
false  entailment 

the  subject,  object,  or  indirect  object  relations  to 
hi,  i.e.,  there  exists  rel(/ii,/i2)  such  that  REL  € 
{subj,  OBJ,  ind}.  If  there  is  some  node  t2  such  that 
align(/i2,  t2),  but  REL(fi,  f2)  f  rel(/ii,  h2),  then 
we  predict  that  the  entailment  is  false. 

As  an  example,  consider  Figure  3,  representing 
subgraphs  from  Dev  Ex.  #1916: 

T:  ...U.N.  officials  are  also  dismayed  that  Aristide  killed  a  con¬ 
ference  called  by  Prime  Minister  Robert  Malval... 

H :  Aristide  kills  Prime  Minister  Robert  Malval. 

Here  let  (h\,ti)  correspond  to  the  aligned  verbs 
with  lemma  kill,  where  the  object  of  h\  has  lemma 
Prime  Minister  Robert  Malval,  and  the  object  of  t  \ 
has  lemma  conference.  Since  h2  is  aligned  to  some 
node  t2  in  the  text  graph,  but  ->OBj(ti,  t2),  the  sen¬ 
tence  pair  is  rejected  as  a  false  entailment. 

4.6  Superlative  mismatch 

If  some  adjective  node  h\  in  the  hypothesis  is  iden¬ 
tified  as  a  superlative,  check  that  all  of  the  following 
conditions  are  satisfied: 

1.  hi  is  aligned  to  some  superlative  t\  in  the  text 
sentence. 

2.  The  noun  phrase  h2  modified  by  h\  is  aligned 
to  the  noun  phrase  t2  modified  by  t\. 


3.  Any  additional  modifier  t2>  of  the  noun  phrase 
t2  is  aligned  to  some  modifier  /13  of  h2  in  the 
hypothesis  sentence  (reverse  subset  match). 

If  any  of  these  conditions  are  not  satisfied,  we  pre¬ 
dict  that  the  entailment  is  false.  This  heuristic  allows 
us  to  predict  false  entailment  in  (Dev  Ex.  #908): 

T:  Time  Warner  is  the  world’s  largest  media  and  Internet  com¬ 
pany. 

H :  Time  Warner  is  the  world’s  largest  company. 

Here  “largest  media  and  Internet  company”  in  T 
fails  the  reverse  subset  match  (condition  3)  to 
’’largest  company”  in  H. 

4.7  Conditional  mismatch 

For  any  pair  of  aligned  nodes  (hi,t\),  if  there  ex¬ 
ists  a  second  pair  of  aligned  nodes  ( h2,t2 )  such 
that  the  shortest  path  path(H,  t2)  in  the  depen¬ 
dency  graph  T  contains  the  conditional  relation, 
then  path(/ji,  h2)  must  also  contain  the  conditional 
relation,  or  else  we  predict  that  the  entailment  is 
false.  For  example,  consider  the  following  false  en¬ 
tailment  (Dev  Ex.  #60): 

T:  If  a  Mexican  approaches  the  border,  he’s  assumed  to  be  try¬ 
ing  to  illegally  cross. 

H :  Mexicans  continue  to  illegally  cross  border. 

Here,  “Mexican”  and  “cross”  are  aligned,  and  the 
path  between  them  in  the  text  contains  the  condi¬ 
tional  relation,  but  does  not  in  the  hypothesis;  thus 
the  entailment  is  predicted  to  be  false. 

4.8  Other  heuristics  for  false  entailment 

In  addition  to  these  heuristics,  we  additionally  im¬ 
plemented  an  IS-A  mismatch  heuristic,  which  at¬ 
tempted  to  discover  when  an  IS-A  relation  in  the  hy¬ 
pothesis  sentence  was  not  implied  by  a  correspond¬ 
ing  IS-A  relation  in  the  text;  however,  this  heuristic 
yielded  a  loss  in  accuracy  on  the  development  set 
and  was  therefore  not  included  in  our  final  system. 

5  Lexical  similarity  and  paraphrase 
detection 

5.1  Lexical  similarity  using  MindNet 

In  case  none  of  the  preceding  heuristics  for  rejec¬ 
tion  are  applicable,  we  back  off  to  a  lexical  sim¬ 
ilarity  model  similar  to  that  described  in  (Glick- 
man  et  al.,  2005).  For  every  content  node  h  6  H 
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not  already  aligned  by  one  of  the  heuristics  in  Sec¬ 
tion  3,  we  obtain  a  similarity  score  MN (h,  t)  from  a 
similarity  database  that  is  constructed  automatically 
from  the  data  contained  in  MindNet5  as  described  in 
(Richardson,  1997).  Our  similarity  function  is  thus: 

!1  if  ANY-ALIGN(/l,  t) 

MN  (h,t)  ifMN  (h,t)>min 
min  otherwise 


Where  the  minimum  score  min  is  a  parameter 
tuned  for  maximum  accuracy  on  the  development 
set;  min  =  0.00002  in  our  final  system.  We  then 
compute  the  entailment  score: 


score(H ,  T) 


1 

77 


n 

heH 


rna x.sim(h,  t ) 
teT 


This  approach  is  identical  to  that  used  in  (Glick- 
man  et  al.,  2005),  except  that  we  use  alignment 
heuristics  and  MindNet  similarity  scores  in  place 
of  their  web-based  estimation  of  lexical  entailment 
probabilities,  and  we  take  as  our  score  the  geomet¬ 
ric  mean  of  the  component  entailment  scores  rather 
than  the  unnormalized  product  of  probabilities. 

5.2  Measuring  phrasal  similarity  using  the  web 

The  methods  discussed  so  far  for  alignment  arc  lim¬ 
ited  to  aligning  pairs  of  single  words  or  multiple - 
word  units  constituting  single  syntactic  categories; 
these  arc  insufficient  for  the  problem  of  detecting 
more  complicated  paraphrases.  For  example,  con¬ 
sider  the  following  true  entailment  (Dev  Ex.  #496): 

T :  ...Muslims  believe  there  is  only  one  God. 

H\  Muslims  are  monotheistic. 

Flere  we  would  like  to  align  the  hypothesis  phrase 
“are  monotheistic”  to  the  text  phrase  “believe  there 
is  only  one  God”;  unfortunately,  single-node  align¬ 
ment  aligns  only  the  nodes  with  lemma  “Muslim”. 
In  this  section  we  describe  the  approach  used  in  our 
system  to  approximate  phrasal  similarity  via  distrib¬ 
utional  information  obtained  using  the  MSN  Search 
search  engine. 

We  propose  a  metric  for  measuring  phrasal  simi¬ 
larity  based  on  a  phrasal  version  of  the  distributional 
hypothesis:  we  propose  that  a  phrase  template  f). 


(e.g.  ‘ Xh  arc  monotheistic’)  has  high  semantic  simi¬ 
larity  to  a  template  Pt  (e.g.  “xy  believe  there  is  only 
one  God”),  with  possible  “slot-fillers”  xy(  and  xt,  re¬ 
spectively,  if  the  overlap  of  the  sets  of  observed  slot- 
fillers  Xh  H  Xt  for  those  phrase  templates  is  high  in 
some  sufficiently  large  corpus  (e.g.,  the  Web). 

To  measure  phrasal  similarity  we  issue  the  sur¬ 
face  text  form  of  each  candidate  phrase  template  as 
a  query  to  a  web-based  search  engine,  and  parse  the 
returned  sentences  in  which  the  candidate  phrase  oc¬ 
curs  to  determine  the  appropriate  slot-fillers.  For  ex¬ 
ample,  in  the  above  example,  we  observe  the  set  of 
slot-tillers  Xt  =  {Muslims,  Christians,  Jews,  Saiv- 
ities,  Sikhs,  Caodaists,  People},  and  X^  fi  Xt  = 
{Muslims,  Christians,  Jews,  Sikhs,  People}. 

Explicitly,  given  the  text  and  hypothesis  logical 
forms,  our  algorithm  proceeds  as  follows  to  compute 
the  phrasal  similarity  between  all  phrase  templates 
in  H  and  T : 

1.  For  each  pair  of  aligned  single  node  and  un¬ 
aligned  leaf  node  (■ t\,ti )  (or  pair  of  aligned 
nodes  (t\,  f2))  in  the  text  T: 

(a)  Use  NLPWin  to  generate  a  surface  text 
string  S  from  the  underlying  logical  form 

PATH(fi,f2). 

(b)  Create  the  surface  string  template  phrase 
Pt  by  removing  from  S  the  lemmas  corre¬ 
sponding  to  t\  (and  t2,  if  path  is  between 
aligned  nodes). 

(c)  Perform  a  web  search  for  the  string  Pi  . 

(d)  Parse  the  resulting  sentences  containing 
Pt  and  extract  all  non-pronoun  slot  tillers 
xt  €  Xt  that  satisfy  the  same  syntactic 
roles  as  t\  in  the  original  sentence. 

2.  Similarly,  extract  the  slot  tillers  Xh  for  each 
discovered  phrase  template  Pf,  in  H. 

3.  Calculate  paraphrase  similarity  as  a  function  of 
the  overlap  between  the  slot-filler  sets  Xt  and 
Xh,  i.e:  score(Ph,Pt )  =  |A~^Af  1  ■ 

We  then  incorporate  paraphrase  similarity  within  the 
lexical  similarity  model  by  allowing,  for  some  un¬ 
aligned  node  h  &  Ph,  where  t  G  Pf. 


3http://research.microsoft.com/mnex 


sim(h,t)  =  max(MN(/i,f) ,  score(Ph,  Pt)) 
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Our  approach  to  paraphrase  detection  is  most  similar 
to  the  TE/ASE  algorithm  (Szpektor  et  ah,  2004),  and 
bears  similarity  to  both  DIRT  (Lin  and  Pantel,  2001) 
and  KnowItAll  (Etzioni  et  ah,  2004).  The  chief 
difference  in  our  algorithm  is  that  we  generate  the 
surface  text  search  strings  from  the  parsed  logical 
forms  using  the  generation  capabilities  of  NLP  Win 
(Aikawa  et  al.,  2001),  and  we  verify  that  the  syn¬ 
tactic  relations  in  each  discovered  web  snippet  arc 
isomorphic  to  those  in  the  original  candidate  para¬ 
phrase  template. 

6  Results  and  Discussion 

In  this  section  we  present  the  final  results  of  our  sys¬ 
tem  on  the  PASCAL  RTE-1  test  set,  and  examine  our 
features  in  an  ablation  study.  The  PASCAL  RTE-1 
development  and  test  sets  consist  of  567  and  800  ex¬ 
amples,  respectively,  with  the  test  set  split  equally 
between  true  and  false  examples. 

6.1  Results  and  Performance  Comparison  on 
the  PASCAL  RTE-1  Test  Set 

Table  2  displays  the  accuracy  and  confidence- 
weighted  score6  (CWS)  of  our  final  system  on  each 
of  the  tasks  for  both  the  development  and  test  sets. 

Our  overall  test  set  accuracy  of  62.50%  rep¬ 
resents  a  2.1%  absolute  improvement  over  the 
task-independent  system  described  in  (Tatu  and 
Moldovan,  2005),  and  a  20.2%  relative  improve¬ 
ment  in  accuracy  over  their  system  with  respect  to 
an  uninformed  baseline  accuracy  of  50%. 

To  compute  confidence  scores  for  our  judgments, 
any  entailment  determined  to  be  false  by  any  heuris¬ 
tic  was  assigned  maximum  confidence;  no  attempts 
were  made  to  distinguish  between  entailments  re¬ 
jected  by  different  heuristics.  The  confidence  of 
all  other  predictions  was  calculated  as  the  ab¬ 
solute  value  in  the  difference  between  the  output 
score(H,  T )  of  the  lexical  similarity  model  and  the 
threshold  t  =  0.1285  as  tuned  for  highest  accu¬ 
racy  on  our  development  set.  We  would  expect  a 
higher  CWS  to  result  from  learning  a  more  appro¬ 
priate  confidence  function;  nonetheless  our  overall 

f’As  in  (Dagan  et  al.,  2005)  we  compute  the  confidence- 
weighted  score  (or  “average  precision”)  over  n  examples 
{ci,  C2, ...,  c„}  ranked  in  order  of  decreasing  confidence  as 

l  (#correct-up-to-rank-i) 

cws  =  -  >  .  .  — - - - 

n  t. — si=\  x 


Dev  Set 

Test  Set 

Task 

acc 

cws 

acc 

cws 

CD 

0.8061 

0.8357 

0.7867 

0.8261 

RC 

0.5534 

0.5885 

0.6429 

0.6476 

IR 

0.6857 

0.6954 

0.6000 

0.6571 

MT 

0.7037 

0.7145 

0.6000 

0.6350 

IE 

0.5857 

0.6008 

0.5917 

0.6275 

QA 

0.7111 

0.7121 

0.5308 

0.5463 

PP 

0.7683 

0.7470 

0.5200 

0.5333 

All 

0.6878 

0.6888 

0.6250 

0.6534 

Table  2:  Summary  of  accuracies  and  confidence- 
weighted  scores,  by  task 


Alignment  Feature 

Dev 

Test 

Synonym  Match 

0.0106 

0.0038 

Derivational  Form 

0.0053 

0.0025 

Paraphrase 

0.0053 

0.0000 

Lexical  Similarity 

0.0053 

0.0000 

Value  Match 

0.0017 

0.0013 

Acronym  Match 

0.0017 

0.0013 

Adjectival  Form7 

0.0000 

0.0063 

False  Entailment  Feature 

Dev 

Test 

Negation  Mismatch 

0.0106 

0.0025 

Argument  Movement 

0.0070 

0.0250 

Conditional  Mismatch 

0.0053 

0.0037 

Modal  Mismatch 

0.0035 

0.0013 

Superlative  Mismatch 

0.0035 

-0.0025 

Entity  Mismatch 

0.0018 

0.0063 

Table  3:  Leature  ablation  study;  quantity  is  the  ac¬ 
curacy  loss  obtained  by  removal  of  single  feature 

test  set  CWS  of  0.6534  is  higher  than  previously- 
reported  task-independent  systems  (however,  the 
task-dependent  system  reported  in  (Raina  et  al., 
2005)  achieves  a  CWS  of  0.686). 

6.2  Feature  analysis 

Table  3  displays  the  results  of  our  feature  ablation 
study,  analyzing  the  individual  effect  of  each  feature. 

Of  the  seven  heuristics  used  in  our  final  system 
for  node  alignment  (including  lexical  similarity  and 
paraphrase  detection),  our  ablation  study  showed 

7  As  discussed  in  Section  2,  features  with  no  effect  on  devel¬ 
opment  set  accuracy  were  included  in  the  system  if  and  only  if 
they  improved  the  system’s  unweighted  F-score. 
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that  five  were  helpful  in  varying  degrees  on  our  test 
set,  but  that  removal  of  either  MindNet  similarity 
scores  or  paraphrase  detection  resulted  in  no  accu¬ 
racy  loss  on  the  test  set. 

Of  the  six  false  entailment  heuristics  used  in  the 
final  system,  five  resulted  in  an  accuracy  improve¬ 
ment  on  the  test  set  (the  most  effective  by  far  was 
the  “Argument  Movement”,  resulting  in  a  net  gain 
of  20  correctly-classified  false  examples);  inclusion 
of  the  “Superlative  Mismatch”  feature  resulted  in  a 
small  net  loss  of  two  examples. 

We  note  that  our  heuristics  for  false  entailment, 
where  applicable,  were  indeed  significantly  more  ac¬ 
curate  than  our  final  system  as  a  whole;  on  the  set  of 
examples  predicted  false  by  our  heuristics  we  had 
71.3%  accuracy  on  the  training  set  (112  correct  out 
of  157  predicted),  and  72.9%  accuracy  on  the  test  set 
(164  correct  out  of  225  predicted). 

7  Conclusion 

In  this  paper  we  have  presented  and  analyzed  a  sys¬ 
tem  for  recognizing  textual  entailment  focused  pri¬ 
marily  on  the  recognition  of  false  entailment,  and 
demonstrated  higher  performance  than  achieved  by 
previous  approaches  on  the  widely-used  PASCAL 
RTE  test  set.  Our  system  achieves  state-of-the- 
art  performance  despite  not  exploiting  a  wide  ar¬ 
ray  of  sources  of  knowledge  used  by  other  high- 
performance  systems;  we  submit  that  the  perfor¬ 
mance  of  our  system  demonstrates  the  unexploited 
potential  in  features  designed  specifically  for  the 
recognition  of  false  entailment. 
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